Analysis of genre popularity based on the Kaggle platform's database 'Spotify HUGE database - daily charts over 3 years (2017-2020)¶

Alt text

Contributors:
Dominika Gerszewska
Marcin Sidoruk
Andrzej Łososowski
Joanna Zielińska

Kaggle dataset:¶

Source: https://www.kaggle.com/datasets/pepepython/spotify-huge-database-daily-charts-over-3-years?select=Final+database

This database contains information about the top 200 daily streaming songs on Spotify for over three years.
It includes a wealth of information for each track, gathered via Spotify's API, such as the artist, country, genre, and other relevant details.
To simplify the data, the popularity of each song has been aggregated into a single score.
This Spotify database is a valuable resource for anyone interested in music or data analysis.

Goal of the project:¶

The goal of this project is to explore and analyze Spotify's daily top 200 streaming songs data over a period of three years.
The project includes a variety of visualizations and analyses, such as identifying the most popular music genres, creating a map of average popularity, analyzing popularity by language, examining musical diversity, and identifying the most frequently occurring genres by country.
Additionally, the project includes a feature that allows users to input a specific genre and view the top 10 countries where that genre is most popular.

The business objective of this project is to provide valuable insights for artists and music industry professionals who are looking to understand music trends and identify opportunities for market entry.
For example, an artist could use this information to determine which countries to target when promoting their music based on the popularity of their genre in different regions.
Similarly, music industry professionals could leverage this data to make informed decisions about marketing and distribution strategies.

In [1]:
#requirements

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore') 
%matplotlib inline
pd.set_option('display.max_columns', 151)
In [2]:
# Loading data #1

#Importing the database with selected columns
df = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Popularity', 'Genre'])
df_1 = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Genre', 'Artist','Title','Album','Cluster','Popularity','Artist_followers'])
In [3]:
# Loading data #2
#Adding extra data set to use in plotly for interpetation country

df_country_iso = pd.read_csv('Country_ISO\countries_codes_and_coordinates.csv') 
df_country_iso = df_country_iso.replace('"','', regex=True) 
df_country_iso = df_country_iso.replace('United Kingdom', 'UK') # adjusting to data in Spotify dataset
In [4]:
# Loading data #3

# Creating dictionary to add 3 letters shortcut to datasetkraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
kraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
iso = list(df_country_iso['Alpha-3 code']) #wyciągnięcie skrótów krajów z iso
dict = {}
iso = [x.strip(' ') for x in iso] # usnięcie spacji ze skrótów kodów
for i,j in zip(kraj,iso): # tworznie słownika na bazie którego zostanie zapełniona kolumna iso_alpha z df
    dict.setdefault(i,j)
In [5]:
# Loading data #4

df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie

df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania potrzbenych krajów
In [6]:
df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie

df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania pot

Data exploration and identification of basic issues¶

In [7]:
df_1.head()
Out[7]:
Country Popularity Title Artist Genre Artist_followers Album Cluster
0 Global 31833.95 adan y eva Paulo Londra argentine hip hop 11427104.0 Adan y Eva global
1 USA 8.00 adan y eva Paulo Londra argentine hip hop 11427104.0 Adan y Eva english speaking and nordic
2 Argentina 76924.40 adan y eva Paulo Londra argentine hip hop 11427104.0 Adan y Eva spanish speaking
3 Belgium 849.60 adan y eva Paulo Londra argentine hip hop 11427104.0 Adan y Eva english speaking and nordic
4 Switzerland 20739.10 adan y eva Paulo Londra argentine hip hop 11427104.0 Adan y Eva english speaking and nordic
In [8]:
df_1.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170633 entries, 0 to 170632
Data columns (total 8 columns):
 #   Column            Dtype  
---  ------            -----  
 0   Country           object 
 1   Popularity        float64
 2   Title             object 
 3   Artist            object 
 4   Genre             object 
 5   Artist_followers  object 
 6   Album             object 
 7   Cluster           object 
dtypes: float64(1), object(7)
memory usage: 10.4+ MB
In [9]:
# Data cleansing

df = df.replace('n-a', np.nan)
df = df.dropna()
df_1 = df_1.replace('n-a', np.nan)
df_1 = df_1.dropna()
drop_index_cl = df_1[df_1.Cluster == 'global'].index
drop_index_c = df[df.Country == 'Global'].index
df.drop(drop_index_c,inplace=True)
df_1.drop(drop_index_cl,inplace=True)

Table of unique values¶

In [10]:
Counutries = df_1['Country'].nunique() 
Genres = df_1['Genre'].nunique() 
Titles = df_1['Title'].nunique() 
Albums = df_1['Album'].nunique() 
Artist = df_1['Artist'].nunique()

df_unique = pd.DataFrame({'Countries': [Counutries],'Genres':[Genres], 'Artist': [Artist] , 'Albums':[Albums], 'Title': [Titles],})
df_unique.style.hide_index()
Out[10]:
Countries Genres Artist Albums Title
34 1119 23347 32633 44930

Data analysis¶

Map of mean popularity¶

In [11]:
# mean to show on map
by_country = df.groupby('iso_alpha')['Popularity'].mean().reset_index().rename(columns={'iso_alpha': 'Country','Popularity':'Mean Popularity'})
In [12]:
# mean to show on map2

uniq = df_1.groupby(['Country','Cluster'])['Popularity'].count().reset_index().sort_values(by = 'Country')
country = df_1.groupby(['Country','Cluster'])['Popularity'].mean().reset_index().rename(columns={'Popularity':'Mean_Popularity'}).sort_values(by = 'Country')
uniq['Mean_Popularity'] = country['Mean_Popularity']
In [13]:
# mean to show on map3

country_list = by_country
fig = px.choropleth(country_list, locations='Country',
                        color='Mean Popularity', # 
                        hover_name='Country', # column to add to hover information
                        color_continuous_scale=px.colors.sequential.Rainbow,
                        width=600,
                        height=600,
                        projection = 'mercator')
fig.update_layout(title='Map of countries')
fig.show()

uniq = uniq.sort_values(ascending=False, by = 'Popularity')
fig = px.bar(uniq,x=uniq.Country,
            y=uniq.Popularity,
            labels={'Country':'Country', 'Popularity':'The number of occurrences'},    
            color = 'Mean_Popularity',            
            color_continuous_scale = px.colors.sequential.Rainbow)
fig.update_layout(title='Number of songs that were on top list 200 in each country')
fig.update_traces(width=0.4)
fig.show()

Distribution by language¶

In [14]:
fig = px.sunburst(country, 
                  path=['Cluster','Country'], 
                  values='Mean_Popularity',
                  color='Mean_Popularity', 
                  color_continuous_scale=px.colors.sequential.Rainbow,
                  width = 600,
                  height = 800,
                  title= 'Distribution by language'
                 )
fig.show()

Mean popularity in each country for language cluster¶

In [15]:
country_en = country[country.Cluster == 'english speaking and nordic'].sort_values(ascending=False, by = 'Mean_Popularity')
country_spanish = country[country.Cluster == 'spanish speaking'].sort_values(ascending=False, by = 'Mean_Popularity')
country_portuguese = country[country.Cluster == 'southern europe and portuguese heritage'].sort_values(ascending=False, by = 'Mean_Popularity')

fig = make_subplots(rows=1, cols=3, subplot_titles=( "Spanish speaking", "Southern europe and portuguese heritage", "English speaking and nordic",), shared_yaxes=True, horizontal_spacing=0.1)
fig.add_trace(go.Bar(x=country_en.Country, y=country_en.Mean_Popularity), row=1, col=3)
fig.add_trace(go.Bar(x=country_spanish.Country, y=country_spanish.Mean_Popularity), row=1, col=1)
fig.add_trace(go.Bar(x=country_portuguese.Country, y=country_portuguese.Mean_Popularity), row=1, col=2)
fig.update_layout(height=400, width=1000,
                  title_text="Mean popularity in each country for language cluster", showlegend=False, yaxis_title='Mean popularity', xaxis_title='Country')
fig.update_traces(width=0.4)
fig.show()

Countries with number of genre diversity¶

In [16]:
count_genre2 = df.groupby('Country')['Genre'].nunique().sort_values(ascending= False)
fig = px.bar(x=count_genre2.index, y=count_genre2.values, labels={'x':'Country', 'y':'The number of different genres'})
fig.update_layout(title='Countries with number of genre diversity ')
fig.update_traces(width=0.4)
fig.show()

Top 10 most popular music genres¶

In [17]:
genre_counts = df['Genre'].value_counts().nlargest(10)
fig = px.bar(x=genre_counts.index, y=genre_counts, labels={'x':'Genre', 'y':'The number of songs'})
fig.update_layout(title='Top 10 most popular music genres')
fig.update_traces(width=0.4)
fig.show()

The most common genre in each country¶

In [18]:
result = df.groupby('Country')['Genre'].apply(lambda x: x.value_counts().nlargest(1)).sort_values(ascending=True).reset_index(name='Counts')
result.rename(columns = {'level_1' : 'Legend'}, inplace=True)
wykres2 = px.bar(result, y='Country', x='Counts', color='Legend', orientation='h')
wykres2.update_traces(textposition='inside',width =0.4)
wykres2.update_layout(xaxis_title='Number of occurrences',
                  yaxis_title='Country',
                  height=800)
wykres2.update_layout(title='The most common genre in each country')
wykres2.show()

Top 10 most popular music genres in selected country¶

In [19]:
poland_counts = df.query('Country == "Poland"')['Genre'].value_counts().nlargest(10)
turkey_counts = df.query('Country == "Turkey"')['Genre'].value_counts().nlargest(10)
ecuador_counts = df.query('Country == "Ecuador"')['Genre'].value_counts().nlargest(10)
fig = make_subplots(rows=1, cols=3, subplot_titles=("Poland", "Turkey", "Ecuador"), shared_yaxes=True)
fig.add_trace(go.Bar(x=poland_counts.index, y=poland_counts), row=1, col=1)
fig.add_trace(go.Bar(x=turkey_counts.index, y=turkey_counts), row=1, col=2)
fig.add_trace(go.Bar(x=ecuador_counts.index, y=ecuador_counts), row=1, col=3)
fig.update_layout(height=400, width=1000, title_text="Top 10 most popular music genres in selected country", showlegend=False, yaxis_title='Number of occurrences', xaxis_title='Genre')

                  
fig.show()
fig = px.bar()

Number of occurrences in countries for selected genre¶

In [20]:
#Display the top 10 countries for selected genre
wprowadzony_gatunek = input("Please enter the name of the music genre for which you want to see ordered countries by count: ")
nowy_df = df.loc[df['Genre'] == wprowadzony_gatunek, ['Genre', 'Country','iso_alpha']]
zliczanie = nowy_df['Country'].value_counts()
zliczanie.columns = ['Country', 'Counts']
top_counts = nowy_df[['iso_alpha','Country']].value_counts().reset_index().rename(columns={0 : 'Counts'})
# poloting map from selected countries
country_list = top_counts
fig = px.choropleth(country_list, locations="iso_alpha",
                        color="Counts", # lifeExp is a column of gapminder
                        hover_name="Country", # column to add to hover information
                        color_continuous_scale=px.colors.sequential.Rainbow,
                        width=800,
                        height=800,
                        projection = 'mercator')
fig.show()

# poloting bar from selected countries
fig = px.bar(nowy_df, x=top_counts['Country'], y=top_counts['Counts'], labels={'x':'Country', 'y':'Number of occurrences'})
fig.update_layout(title=f"Count in countries for selected genre ({wprowadzony_gatunek})")
fig.show()
Please enter the name of the music genre for which you want to see ordered countries by count: k-pop